Inference

Class notes

  • Class survey is currently at 2,283 responses (!)

  • I’ll be doing some cleanup this week and then sharing it with you all over the weekend.

  • Be on the look out for another survey analysis assignment.

Class notes

  • We’ve only got around 1/3rd of your grade accounted for so far, so don’t get complacent.

  • Pay attention to your R script. You should be submitting a .R file with the code you used to answer your questions. Not a screenshot.

Refresher on the CLT

We’ll be going back over the chapter 6 material we covered before the break.

Review of some concepts

Mean (\(\mu\))

The (arithmetic) mean or average of a variable is the sum of every element divided by the number of observations

Standard deviation (\(\sigma\))

The standard deviation of a measure of dispersion. It tells us how much things vary around the mean.

Review of some concepts

Standardization:

We can “mean-center” a variable by subtracting its mean from each observation. Then we can divide each observation by the standard deviation. This will give us a variable with a mean of zero and standard deviation of 1. (We’ll often refer to this standardized variable as “Z” or a “Z-score”.)

\[ \text{Z score} = \frac{\text{Deviation from the mean}}{\text{Standard Deviation}} \]

Z-Standardization

\[ \text{Z score} = \frac{\text{Deviation from the mean}}{\text{Standard Deviation}} = \frac{x - \mu}{\sigma} \]

Shyanne Sellers is 74 inches

  • Average female height: is 64.5 inches
  • Standard deviation of female height: 2.5 inches
  • \(74 - 64.5 = 9.5\) inches above average. So Z \(= 9.5/2.5 = +3.8\)

Z-Standardization

Note that we can convert back and forth between these two values without losing anything:

\[ Z = \frac{x - \mu}{\sigma} \]

\[ x = \mu + Z \times \sigma \] So we can standardize any continuous variable to make it have a mean of zero and a standard deviation of one, and then transform it back later.

Distributions

Probability distribution

A probability distribution is a function that describes the probability of different random values of a variable or statistic. If I know the distribution and its parameters for something, then I can make predictions about the probability of some outcome using the probability density function for that distribution.

Terms: Sampling distribution

Sampling distribution

A sampling distribution is just a probability distribution for a sample statistic like the sample average or proportion. It answers the question: “how would the averages/proportions/medians etc. be distributed if I could take an infinite number of random samples from this population and calculate the sample statistic for each one?”

Terms: Error

Error

Error is the difference between a population statistic and a sample statistic. If the population mean is 1.6 and my sample mean is 1, then my error is -.6. And: If our sampling strategy is unbiased, the sampling distribution should have an average error of 0.

Terms: Standard error

Standard Error

Standard error is the standard deviation of sample means. This will depend partly on the size of our samples and partly on the amount of variation in the population (increasing sample size will decrease the standard error, increasing variance will increase the standard error.)

Sampling Distributions

Drawing one sample from a probability distribution:

set.seed(100) # makes it so we can replicate a random process
x<-runif(n = 250, min = -2, max= 10)
hist(x)

Drawing 1,000 samples from that distribution, and calculating the mean from each one

set.seed(100) # makes it so we can replicate a random process
# replicate 1000 times and calculate means
x_mu<-replicate(1000, mean(runif(n = 250, min = -2, max= 10)))
hist(x_mu)

The normal distribution

  • The normal distribution is a probability distribution that produces data that looks like this:
  • The normal distribution has two parameters: the mean (\(\mu\)) and standard deviation (\(\sigma\)). If we know these parameters we can calculate the area under any set of points in this curve.
  • So, we can use this to make predictions like “what % of observations would be 2 or Z scores above the mean?”

The normal distribution

The normal distribution

Z-standardization makes this even easier because it ensures we’re always talking about \(\mu=0\) and \(\sigma=1\).

So, long before computers, we could get the probability of an observation by calculating its z-score and then looking it up in a table:

instant migraine!

instant migraine!

The normal distribution

…but we can also just use R:

Get the probability of a value \(\geq Z\)

pnorm(1.96)
[1] 0.9750021

Get the Z score for the 97.5th percentile:

qnorm(.975)
[1] 1.959964

The normal distribution in nature

  • Many variables, like human height, naturally resemble a normal distribution
  • So, we could estimate the % of adult women who are taller than Shyanne Sellers using her Z score of 3.8
  • Very few!

The normal distribution and the CLT

Height naturally follows a normal distribution, but many variables don’t. However, the central limit theorem tells us that sampling distributions are normal even if population distributions or non-normal.

Central Limit Theorem

As the number of repeated samples approaches infinity, the sampling distribution of a sample mean will converge toward a normal distribution centered on the population mean..

Moreover, the standard deviation of this sampling distribution (the standard error) will be \(\frac{\sigma}{\sqrt{n}}\). (The population standard deviation divided by the square root of the sample size)

The Central Limit Theorem

When I take a decent-sized random sample a continuous variable from a population with a fixed mean:

  • I know the sample mean (\(\bar{x}\)) is drawn from a normal distribution.

  • I know the mean (\(\mu\)) of this sampling distribution is equal to the population mean.

  • I know the standard error of this sampling distribution is equal to \(\frac{\sigma}{\sqrt{n}}\)

    • So: I know that 95% of my sample averages or proportions will be within \(\approx 1.96\) standard errors of the correct answer (and any number of other probabilities based on the normal curve)

Margin of error for a proportion

In the 2020 Census, \(46.3\%\) of households were headed by a married couple.

What is the probability that a random sample of 1,000 households would be over or under by \(3\%\)?

Margin of error for a proportion

Stated differently, what % of sample errors would be within \(\pm .03\) of the population proportion?

Z-score of the error

Better yet: we can think about this in terms of the Z-score for an error of \(\pm .03\), then calculate the probability using the standard normal distribution.

\[Z = \frac{\pm\text{.03}}{\text{standard error}}\]

Z-score of the error

So, all we need here is the standard error. We know its \(\frac{\sigma}{\sqrt n}\), but what is the standard deviation of our sampling distribution?

For a proportion this is always \(\sqrt{p (1-p)}\) so \(\sqrt{.463 \times .537} = .495\)

So: \(SE = \frac{.495}{\sqrt{1,000}} = 0.016\)

So the Z score for an error of +.03 would be: \(Z = \frac{+.03}{\text{.016}} = 1.875\)

And the Z-score for an error of -.03 is: \(Z = \frac{-.03}{\text{.016}} = -1.875\)

Error for a proportion

What’s the probability of an error \(\geq 03\)?

pnorm(1.875, lower.tail = FALSE)
[1] 0.03039636

Error for a proportion

What’s the probability of an error that’s \(>=3\%\) lower than the correct estimate?

pnorm(-1.875)
[1] 0.03039636

Error for a proportion

set.seed(1000)
n<-1000                                           # the size of one sample
nsample<-50000                                    # the number of samples to draw
pop_prop <- .463                                  # the proportion of "successes" in the population
sample_props<-rbinom(nsample, n, pop_prop)/n      # drawing the samples
errors<- sample_props - pop_prop                  # calculating the errors
quantile(errors, c(.03, .97))                     # the 3rd and 97th percentile
   3%   97% 
-0.03  0.03 

Confidence Intervals

Instead of asking “how likely is an error of ___ size?” confidence intervals ask “what’s a range of values that is likely to contain the population parameter?”

In other words whats the range that would contain ____% of the sample means?

Confidence Intervals

  1. Pick a desired confidence range (90%, 95%, or 99% are common)
  2. Find the Z score associated with that value
  3. Multiply the standard error by the Z score
  4. Subtract this value to get the lower boundary and add it to get the upper boundary.

\[ \text{CI} = \text{Sample Mean} \pm (Z \times \text{standard error}) \]

Confidence intervals

  • The Z score for a 95% confidence interval is 1.96.

  • Using the previous example, if I estimate that 43% of households are married:

\[ \text{CI} = \text{Sample Mean} \pm (Z \times \text{standard error}) \]

The upper 95% confidence interval is:

\[ \text{Upper } 95\% \text{ CI} = .43 + (1.96 \times \text{.016}) = .46 \]

The lower 95% confidence interval is:

\[ \text{Lower } 95\% \text{ CI} = .43 - (1.96 \times \text{.016}) = .40 \]

Confidence Intervals Z scores

\[ 90\% \text{ CI} = \text{Sample Mean} \pm (1.64 \times \text{standard error}) \]

\[ 95\% \text{ CI} = \text{Sample Mean} \pm (1.96 \times \text{standard error}) \]

\[ 99\% \text{ CI} = \text{Sample Mean} \pm (2.48 \times \text{standard error}) \]

Confidence Intervals

Technically: 95% confidence means: “if I repeated this survey an infinite number of times and recalculated the CI for each one, 95% of my calculated confidence intervals would contain the actual population value.”

More intuitively: “We’re 95% certain that this interval contains the population value.”

Estimating \(\sigma\)

How do we estimate a standard error for a variable when the population standard deviation is unknown?

If the sample is large enough we can use the sample standard deviation itself to estimate the population standard deviation after a tiny correction:

Population standard deviation:

\[ \sigma = \sqrt{\frac{\sum_{i} (x_{i} - \mu)^2}{n}} \]

Sample standard deviation:

\[ s = \sqrt{\frac{\sum_{i} (x_{i} - \mu)^2}{n-1}} \]

Estimating \(\sigma\)

\[ s = \sqrt{\frac{\sum_{i} (x_{i} - \mu)^2}{n-1}} \]

R actually performs this correction automatically when you use the sd() function, so you won’t need to do this manually. After this, the basic intuition is the same for confidence intervals

Estimating \(\sigma\)

How large is large enough to estimate a standard deviation from a sample?

Generally, around \(n=30\), the sample standard error will be close enough to the population standard error. When we have a standard deviation below n=30, we will have to adopt a different method…

Review

If I know:

  • that some variable follows a normal distribution

  • and I know the mean

  • and I know the standard deviation

Then I can predict the probability of drawing some hypothetical value from that distribution.

Review

According to the central limit theorem, the sampling distribution of sample means will be:

  • normally distributed and centered around the population mean (\(\mu\))

  • with a standard deviation \(=\frac{\sigma}{\sqrt{n}}\)

We can use this knowledge, coupled with what we know about the normal distribution, to predict the probability of an error of a given size.

Review

In repeated sampling:

  • 90% of sample means will be within 1.64 standard errors of the true population mean

  • 95% of sample means will be within 1.96 standard errors of the true population mean

  • 99% of sample means will be within 2.58 standard errors of the true population mean

So: we don’t know if any one sample is correct, but we do know the probability of getting an estimate that is wrong by 3 standard errors is very low.

Review

We can use this insight to construct a confidence interval (think: margin of error) around a sample mean:

\[ 90\% \text{ CI} = \text{Sample Mean} \pm (1.64 \times \text{standard error}) \]

\[ 95\% \text{ CI} = \text{Sample Mean} \pm (1.96 \times \text{standard error}) \]

\[ 99\% \text{ CI} = \text{Sample Mean} \pm (2.48 \times \text{standard error}) \]

Review

But remember that to estimate the standard error, we need to know the population standard deviation:

\[ \text{SE} = \frac{\sigma}{\sqrt{n}} \]

We can generally estimate this accurately from the sample standard deviation, but only if we have a large enough sample.

Notation

We’ll use some shorthand notation to distinguish population parameters from sample estimates:

Measure Population Sample
Mean \(\mu\) (“mu”) \(\bar{X}\) (“x-bar”)
Standard Deviation \(\sigma\) (“sigma”) \(s\) (also “sigma”)
Size \[ N \] \[ n \]

Try it out

Head over to this link and follow the instructions:

https://forms.gle/y6824v2Mb12XExvz8

R-script template here:

https://raw.githubusercontent.com/Neilblund/GVPT-201-Site/refs/heads/main/R%20code/clt_with_dice.R

Estimating \(\sigma\) for a small sample

What if we don’t have a lot of observations?

(prepare for a slight digression)

Pearson and Fisher

Much of modern statistical science comes from these two…

Karl Pearson in 1910

Karl Pearson in 1910

Ronald Fisher in 1913

Ronald Fisher in 1913

William Sealy Gosset

  • “Head experimental brewer” at one of the first industrial scale breweries in the world

  • Studied with Karl Pearson for two years, and wanted to use statistical methods to identify high-yield barley

  • But Gosset found that estimating \(\sigma\) from \(s\) didn’t work for his relatively small samples.

  • Gosset (publishing pseudonymously under the name “Student”) found that you needed to adjust for small sample sizes when using \(s\) to estimate \(\sigma\)

Estimating \(\sigma\)

We’re trying to estimate the population standard deviation using the sample standard deviation. In large samples, these are going to be very close most of the time. In very small samples, the \(s\) will be biased toward zero. So in very small samples, we would tend to underestimate uncertainty.

Student’s T Distribution

  • Remember that the normal distribution has just two parameters: mean and standard deviation.
  • “Student’s T” distribution has an extra parameter - degrees of freedom - which is equal to the sample size minus the number of parameters we’re trying to estimate.

    • (For right now, this just means \(n-1\) because we’re just using the sample to estimate the population mean.)
  • When df is low (less than 30) the T distribution has “fatter tails”
  • Around df >= 30 it is basically indistinguishable from the normal.

Student’s T Distribution

The basic logic for confidence intervals remains the same:

  1. Pick a desired confidence range (90%, 95%, or 99% are common)
  2. Find the T score associated with that value given your degrees of freedom (n-1 right now)
  3. Multiply the standard error by the T score
  4. Subtract this value to get the lower boundary and add it to get the upper boundary.

\[ \text{CI} = \text{Sample Mean} \pm (T \times \text{standard error}) \]

Student’s T Distribution

If I have 20 observations, then my critical T score for a 95% confidence interval around the mean is:

qt(.975, df=20)
[1] 2.085963

\[ 95\% \text{ CI} = \text{Sample Mean} \pm (2.09 \times \text{standard error}) \]

Student’s T Distribution

Using the normal distribution

library(RCPA3)

xbar <- mean(states$attend.pct)

se <- sd(states$attend.pct)/sqrt(50)

# Lower 95% bound
xbar - qnorm(.975) * se
[1] 36.34258
# Upper 95% bound
xbar + qnorm(.975) * se
[1] 41.53742

Using T:

# Lower 95% bound
xbar - qt(.975, 49) * se
[1] 36.27684
# Upper 95% bound
xbar + qt(.975, 49) * se
[1] 41.60316

Student’s T Distribution

  • Adjusts for bias in \(s\) in small sample sizes.

  • When the sample size is large, it barely matters. When the sample size is small, it can prevent us from having confidence intervals that are too small.

  • Its also not necessary for proportions, because we can just get the standard deviation from \(\sqrt{p * (1-p)}\)

Notes and caveats

  • All of this stuff applies to quantifying random error, it doesn’t help us quantify bias.

  • If our sample is biased, we can still calculate confidence intervals, but they only tell us about what would happen in repeated surveys with the sample biased sampling process.

Notes and caveats

  • When it comes to weighted surveys, we’ll need to make additional adjustments for the effect of weighting on our amount of uncertainty.

  • When it comes to the class survey, we’ll still calculate things like confidence intervals, but we’ll just need to acknowledge that we’re not really able to make direct inferences about the population.

Next week

What if I find that two samples have different means. How do I know when this is random sampling error? How do I know when its the result of a systematic difference?